Background

In this brief practical, we want to make sure we are all on the same page when it comes to the tidyverse. We are going to use photospectrometer readings of skin pigmentation of tadpoles to go through some examples of how to use the tidyverse in R to perform data manipulation.

The Tidyverse

Question: What is the Tidyverse?

  • A collection of R packages
  • The packages are focused on data science and they share an underlying grammar and design. In other words, they play well together!

Efficient data manipulation and visualization becomes increasingly important when working with large datasets. In genomics, we are often working with 10s of thousands or 100s of thousands of lines of data, and often more than one related table or dataframe, that all needs to be manipulated in R.

Installing and using the tidyverse

install.packages("tidyverse",repos ="https://cran.rstudio.com/")
## 
## The downloaded binary packages are in
##  /var/folders/vl/2k0yktf940g3r6rj5ktvwpxm0000gn/T//RtmpfJtTE4/downloaded_packages
library(tidyverse)

You should see that a call to load tidyverse library essentially just loads a number of "core" packages. It also tells you if there are any conflicting functions. For example, if you call filter(), it will use the filter function from the dplyr package, unless you specify to use the base stats package.

Experimental Background and data loading

As a quick re-cap, we have just finished a pigmentation plasticity experiment with Pelobates cultripes tadpoles. We reared a bunch of tadpoles from the same egg clutch (e.g. genetically similar individuals) under standardized laboratory conditions. The only condition we manipulated is whether tadpoles were reared in a black or a white bucket. After 40 or so days of exposure to these conditions, we anesthetized the tadpoles to be able to take photospectrometry measurements of the skin colour on their dorsum and ventrum, and to also a skin sample that we stored in RNAlater for an RNA extraction and gene expression analsis.

The photospetrometry measurements are stored as a comma-separated values table (.csv file), and we have a second file that contains information about what background the tadpoles were reared on. Lets load the first dataset using one of the tidyverse functions and lets explore it.

dat_spect<-read_csv("../data/minolta_data.csv")
dat_spect
## # A tibble: 34 × 41
##       id sample `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm` `430nm`
##    <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     1 7_D       8.42    8.26    7.91    7.25    6.52    6.52    7.68    8.35
##  2     2 7_D       8.55    8.3     8.07    7.69    7.34    7.53    8.21    8.62
##  3     3 7_V      35.6    38.2    38.5    36.8    34.7    33.3    33.9    36.3 
##  4     4 7_V      35.6    37.6    37.7    35.8    33.8    32.6    33.2    35.3 
##  5     5 8_D       8.25    8.07    7.9     7.63    7.47    8.06    9.81   10.9 
##  6     6 8_D       8.21    7.99    7.75    7.43    7.21    7.76    9.19   10.2 
##  7     7 8_V      26.6    28.2    28.6    27.7    26.7    26.3    26.3    27.2 
##  8     8 8_V      25.3    26.5    26.8    26.1    25.3    24.9    25.1    26.0 
##  9     9 9_D       8.46    8.51    8.47    8.34    8.37    9.1    10.5    11.4 
## 10    10 9_D       8.72    8.58    8.39    8.18    8.14    8.78   10.2    11.0 
## # ℹ 24 more rows
## # ℹ 31 more variables: `440nm` <dbl>, `450nm` <dbl>, `460nm` <dbl>,
## #   `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>, `510nm` <dbl>,
## #   `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>, `560nm` <dbl>,
## #   `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>, `610nm` <dbl>,
## #   `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>, `660nm` <dbl>,
## #   `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, `700nm` <dbl>, …

When we call the object, we should notice that it has been stored as a tibble. This is essentially a dataframe, but it allows us to preview the first 10 rows, without calling the head() function and it lists all column names and data classes per column. Note: tibbles generally do not have rownames.

Question: From this preview, can you tell what kind of information the table contains?

  • It contains a mix of numeric and character variables
  • The first column is the ID of the measurement taken and the second is some sample information
  • All other columns are reflectance values at different wavelengths (per every 10nm)
  • Sample IDs are not unique: there are at least two repeated measurements per sample.

Lets get into some practical examples of how we can use the tidyverse to manipulate data in R.

1. The pipe

Essentially, pipes (%>% or |>) are special functions that allow you to take the output of one operation and use it as the input of another operation. Here is a very simple example:

We could use nrow() to count how many rows are in this table like so:

nrow(dat_spect)
## [1] 34

We can do the same operation using a pipe, like this:

dat_spect %>% nrow()
## [1] 34
## same as:
dat_spect %>%
  nrow()
## [1] 34

Note that for the functions that receive an input, you no longer have to specify the data argumen. With this basic example, it is hard to justify using the pipe, but as we continue on in this practical, you will see that pipes allow for a much more legible code structure, with less redundancy (e.g. no need to call the data object every time).

2. Select, filter and arrange

If you are mostly working with base R, then you will most likely rely heavily on subset(), Boolean operators TRUE/FALSE and row and column indices [row-number,column-number] to filter, select and sort your data. The tidyverse can do much of the same, but with a more intuitive set of functions.

Lets subset the same dataset to only include dorsal measurements (filter() rows), and then keep (select() columns) only the specimen id and reflectance measurements. Then we can sort (arrange()) it by the specimen id column.

dat_spect %>%
  filter(id>20) %>% # specify rows to keep
  select(-sample) %>% # specify columns to keep (or with a minus, which ones to drop)
  arrange(`360nm`) # specify row order based on column values
## # A tibble: 14 × 40
##       id `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm` `430nm` `440nm`
##    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1    30    3.64    3.45    3.33    3.18    3.02    2.98    3       2.95    2.93
##  2    31    5.63    5.26    5.01    4.73    4.44    4.23    4.16    4.01    3.91
##  3    22    8.13    7.77    7.53    7.14    6.79    6.53    6.37    6.23    6.17
##  4    27    8.34    7.78    7.29    6.73    6.17    6.35    7.63    8.44    9.05
##  5    23    8.6     8.25    8.03    7.65    7.3     7.05    6.88    6.71    6.67
##  6    26    9.76    9.21    8.81    8.25    7.72    7.94    9.25   10.0    10.6 
##  7    25   13.8    13.7    13.5    12.9    12.3    11.8    11.6    11.7    12.1 
##  8    24   14.6    14.8    14.8    14.3    13.6    13.2    13.0    13.2    13.7 
##  9    32   15.9    15.3    14.8    14.0    13.1    12.6    12.3    12.4    13.0 
## 10    21   16.7    16.4    16.1    15.3    14.6    14.1    13.8    13.9    14.4 
## 11    33   18.7    18.1    17.4    16.4    15.4    14.8    14.6    14.9    15.8 
## 12    34   24.0    23.2    22.6    21.6    20.5    19.7    19.3    19.4    20.2 
## 13    28   39.9    41.1    40.9    38.9    36.1    34.9    36.5    38.8    42.0 
## 14    29   40.3    41.4    41      38.9    36.2    34.9    36.5    38.9    42.0 
## # ℹ 30 more variables: `450nm` <dbl>, `460nm` <dbl>, `470nm` <dbl>,
## #   `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>, `510nm` <dbl>, `520nm` <dbl>,
## #   `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>, `560nm` <dbl>, `570nm` <dbl>,
## #   `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>, `610nm` <dbl>, `620nm` <dbl>,
## #   `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>, `660nm` <dbl>, `670nm` <dbl>,
## #   `680nm` <dbl>, `690nm` <dbl>, `700nm` <dbl>, `710nm` <dbl>, `720nm` <dbl>,
## #   `730nm` <dbl>, `740nm` <dbl>

3. Mutate, separate and strings

Creating a new variable in the tidyverse uses the mutate() function. By looking at the sample column, we can see that there is some relevant data hidden in there. This variable seems to be composed of a sample number followed by D = Dorsal, V = Ventral. Let's see if we can extract that information and structure it into separate columns.

We can do this most easily with the canned separate() function:

dat_spect %>%
  separate(col=sample, into=c("specimen_id","position"), sep="_")
## # A tibble: 34 × 42
##       id specimen_id position `360nm` `370nm` `380nm` `390nm` `400nm` `410nm`
##    <dbl> <chr>       <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     1 7           D           8.42    8.26    7.91    7.25    6.52    6.52
##  2     2 7           D           8.55    8.3     8.07    7.69    7.34    7.53
##  3     3 7           V          35.6    38.2    38.5    36.8    34.7    33.3 
##  4     4 7           V          35.6    37.6    37.7    35.8    33.8    32.6 
##  5     5 8           D           8.25    8.07    7.9     7.63    7.47    8.06
##  6     6 8           D           8.21    7.99    7.75    7.43    7.21    7.76
##  7     7 8           V          26.6    28.2    28.6    27.7    26.7    26.3 
##  8     8 8           V          25.3    26.5    26.8    26.1    25.3    24.9 
##  9     9 9           D           8.46    8.51    8.47    8.34    8.37    9.1 
## 10    10 9           D           8.72    8.58    8.39    8.18    8.14    8.78
## # ℹ 24 more rows
## # ℹ 33 more variables: `420nm` <dbl>, `430nm` <dbl>, `440nm` <dbl>,
## #   `450nm` <dbl>, `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>,
## #   `500nm` <dbl>, `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>,
## #   `550nm` <dbl>, `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>,
## #   `600nm` <dbl>, `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>,
## #   `650nm` <dbl>, `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>, …

If we want more control, when working with strings (characters), the tidyverse leans heavily on the stringr package. Most functions are self explanatory and start with "str_". These are very powerful functions because they work with regular expressions (patterns). Lets try to make a new column, only with the anatomical position.

dat_spect %>%
  mutate(position=str_extract(string=sample, pattern="[A-Z]")) # mutate to create a new variable/column, [A-Z] to find only letters
## # A tibble: 34 × 42
##       id sample `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm` `430nm`
##    <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1     1 7_D       8.42    8.26    7.91    7.25    6.52    6.52    7.68    8.35
##  2     2 7_D       8.55    8.3     8.07    7.69    7.34    7.53    8.21    8.62
##  3     3 7_V      35.6    38.2    38.5    36.8    34.7    33.3    33.9    36.3 
##  4     4 7_V      35.6    37.6    37.7    35.8    33.8    32.6    33.2    35.3 
##  5     5 8_D       8.25    8.07    7.9     7.63    7.47    8.06    9.81   10.9 
##  6     6 8_D       8.21    7.99    7.75    7.43    7.21    7.76    9.19   10.2 
##  7     7 8_V      26.6    28.2    28.6    27.7    26.7    26.3    26.3    27.2 
##  8     8 8_V      25.3    26.5    26.8    26.1    25.3    24.9    25.1    26.0 
##  9     9 9_D       8.46    8.51    8.47    8.34    8.37    9.1    10.5    11.4 
## 10    10 9_D       8.72    8.58    8.39    8.18    8.14    8.78   10.2    11.0 
## # ℹ 24 more rows
## # ℹ 32 more variables: `440nm` <dbl>, `450nm` <dbl>, `460nm` <dbl>,
## #   `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>, `510nm` <dbl>,
## #   `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>, `560nm` <dbl>,
## #   `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>, `610nm` <dbl>,
## #   `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>, `660nm` <dbl>,
## #   `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, `700nm` <dbl>, …
# it looks like we have created the column, but it's at the very end. lets move around the columns
dat_spect %>%
  mutate(position=str_extract(string=sample, pattern="[A-Z]")) %>%
  select(position,everything())
## # A tibble: 34 × 42
##    position    id sample `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm`
##    <chr>    <dbl> <chr>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 D            1 7_D       8.42    8.26    7.91    7.25    6.52    6.52    7.68
##  2 D            2 7_D       8.55    8.3     8.07    7.69    7.34    7.53    8.21
##  3 V            3 7_V      35.6    38.2    38.5    36.8    34.7    33.3    33.9 
##  4 V            4 7_V      35.6    37.6    37.7    35.8    33.8    32.6    33.2 
##  5 D            5 8_D       8.25    8.07    7.9     7.63    7.47    8.06    9.81
##  6 D            6 8_D       8.21    7.99    7.75    7.43    7.21    7.76    9.19
##  7 V            7 8_V      26.6    28.2    28.6    27.7    26.7    26.3    26.3 
##  8 V            8 8_V      25.3    26.5    26.8    26.1    25.3    24.9    25.1 
##  9 D            9 9_D       8.46    8.51    8.47    8.34    8.37    9.1    10.5 
## 10 D           10 9_D       8.72    8.58    8.39    8.18    8.14    8.78   10.2 
## # ℹ 24 more rows
## # ℹ 32 more variables: `430nm` <dbl>, `440nm` <dbl>, `450nm` <dbl>,
## #   `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>,
## #   `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>,
## #   `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>,
## #   `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>,
## #   `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, …

For now, lets stick to the simpler separate() function, and lets also remove the ID column. We can save the original data object to apply the changes permanently. I generally don't recommend to overwrite the original object (i.e. saving with the same name) becuase this can lead to issues when going back up to earlier parts of code.

dat_clean<-dat_spect %>%
  separate(col=sample, into=c("specimen_id","position"), sep="_") %>%
  select(-id)

dat_clean
## # A tibble: 34 × 41
##    specimen_id position `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm`
##    <chr>       <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 7           D           8.42    8.26    7.91    7.25    6.52    6.52    7.68
##  2 7           D           8.55    8.3     8.07    7.69    7.34    7.53    8.21
##  3 7           V          35.6    38.2    38.5    36.8    34.7    33.3    33.9 
##  4 7           V          35.6    37.6    37.7    35.8    33.8    32.6    33.2 
##  5 8           D           8.25    8.07    7.9     7.63    7.47    8.06    9.81
##  6 8           D           8.21    7.99    7.75    7.43    7.21    7.76    9.19
##  7 8           V          26.6    28.2    28.6    27.7    26.7    26.3    26.3 
##  8 8           V          25.3    26.5    26.8    26.1    25.3    24.9    25.1 
##  9 9           D           8.46    8.51    8.47    8.34    8.37    9.1    10.5 
## 10 9           D           8.72    8.58    8.39    8.18    8.14    8.78   10.2 
## # ℹ 24 more rows
## # ℹ 32 more variables: `430nm` <dbl>, `440nm` <dbl>, `450nm` <dbl>,
## #   `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>,
## #   `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>,
## #   `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>,
## #   `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>,
## #   `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, …

4. Grouping and summarizing!

We have at least two measurements per sample. Going forward, we might want to summarise that into just a single mean value per sample. We can do that very easily using the group_by() and summarise() functions.

dat_clean %>%
  group_by(specimen_id, position) %>% # group by one or more columns
  summarise(mean=mean(`360nm`)) # use summarise to create a new variable, and apply a function to it
## # A tibble: 16 × 3
## # Groups:   specimen_id [8]
##    specimen_id position  mean
##    <chr>       <chr>    <dbl>
##  1 10          D         3.10
##  2 10          V        14.2 
##  3 11          D         3.88
##  4 11          V        15.8 
##  5 12          D         8.36
##  6 12          V        14.2 
##  7 21          D         9.05
##  8 21          V        40.1 
##  9 22          D         4.64
## 10 22          V        19.5 
## 11 7           D         8.48
## 12 7           V        35.6 
## 13 8           D         8.23
## 14 8           V        26.0 
## 15 9           D         8.59
## 16 9           V        26.8

We now have a new data frame with only 16 rows (single rows per specimen+position combination), and we have a new column called "mean" which is the mean of the 360nm reflectance values.

It is much more powerful however... we could for example apply more than one summarizing function:

dat_clean %>%
  group_by(specimen_id, position) %>% 
  summarise(mean=mean(`360nm`),
            sd=sd(`360nm`)) 
## # A tibble: 16 × 4
## # Groups:   specimen_id [8]
##    specimen_id position  mean      sd
##    <chr>       <chr>    <dbl>   <dbl>
##  1 10          D         3.10 0.672  
##  2 10          V        14.2  0.792  
##  3 11          D         3.88 1.18   
##  4 11          V        15.8  2.69   
##  5 12          D         8.36 0.332  
##  6 12          V        14.2  0.573  
##  7 21          D         9.05 1.00   
##  8 21          V        40.1  0.297  
##  9 22          D         4.64 1.41   
## 10 22          V        19.5  4.11   
## 11 7           D         8.48 0.0919 
## 12 7           V        35.6  0.00707
## 13 8           D         8.23 0.0283 
## 14 8           V        26.0  0.955  
## 15 9           D         8.59 0.184  
## 16 9           V        26.8  1.77

Or we can summarise many columns at once

dat_clean %>%
  group_by(specimen_id, position) %>% 
  summarise_all(mean) 
## # A tibble: 16 × 41
## # Groups:   specimen_id [8]
##    specimen_id position `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm`
##    <chr>       <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 10          D           3.10    3.04    3.00    2.9     2.80    2.82    2.88
##  2 10          V          14.2    14.2    13.9    13.2    12.5    12.1    12.0 
##  3 11          D           3.88    3.70    3.57    3.36    3.16    3.06    3.06
##  4 11          V          15.8    15.5    15.1    14.4    13.6    13.2    12.8 
##  5 12          D           8.36    8.01    7.78    7.40    7.04    6.79    6.62
##  6 12          V          14.2    14.3    14.2    13.6    13.0    12.5    12.3 
##  7 21          D           9.05    8.50    8.05    7.49    6.94    7.14    8.44
##  8 21          V          40.1    41.2    40.9    38.9    36.2    34.9    36.5 
##  9 22          D           4.64    4.36    4.17    3.96    3.73    3.61    3.58
## 10 22          V          19.5    18.9    18.3    17.3    16.4    15.7    15.4 
## 11 7           D           8.48    8.28    7.99    7.47    6.93    7.02    7.94
## 12 7           V          35.6    37.9    38.1    36.3    34.3    32.9    33.5 
## 13 8           D           8.23    8.03    7.82    7.53    7.34    7.91    9.5 
## 14 8           V          26.0    27.4    27.7    26.9    26.0    25.6    25.7 
## 15 9           D           8.59    8.54    8.43    8.26    8.25    8.94   10.4 
## 16 9           V          26.8    27.8    27.9    26.8    25.6    25.0    25.0 
## # ℹ 32 more variables: `430nm` <dbl>, `440nm` <dbl>, `450nm` <dbl>,
## #   `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>,
## #   `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>,
## #   `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>,
## #   `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>,
## #   `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, `700nm` <dbl>,
## #   `710nm` <dbl>, `720nm` <dbl>, `730nm` <dbl>, `740nm` <dbl>

This is what we set out to do, so lets save this data object and lets make sure we drop the grouping variable, just as good practice. We will also do one additional step.

dat_mean<-dat_clean %>%
  group_by(specimen_id, position) %>% 
  summarise_all(mean)  %>%
  ungroup()

dat_mean
## # A tibble: 16 × 41
##    specimen_id position `360nm` `370nm` `380nm` `390nm` `400nm` `410nm` `420nm`
##    <chr>       <chr>      <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1 10          D           3.10    3.04    3.00    2.9     2.80    2.82    2.88
##  2 10          V          14.2    14.2    13.9    13.2    12.5    12.1    12.0 
##  3 11          D           3.88    3.70    3.57    3.36    3.16    3.06    3.06
##  4 11          V          15.8    15.5    15.1    14.4    13.6    13.2    12.8 
##  5 12          D           8.36    8.01    7.78    7.40    7.04    6.79    6.62
##  6 12          V          14.2    14.3    14.2    13.6    13.0    12.5    12.3 
##  7 21          D           9.05    8.50    8.05    7.49    6.94    7.14    8.44
##  8 21          V          40.1    41.2    40.9    38.9    36.2    34.9    36.5 
##  9 22          D           4.64    4.36    4.17    3.96    3.73    3.61    3.58
## 10 22          V          19.5    18.9    18.3    17.3    16.4    15.7    15.4 
## 11 7           D           8.48    8.28    7.99    7.47    6.93    7.02    7.94
## 12 7           V          35.6    37.9    38.1    36.3    34.3    32.9    33.5 
## 13 8           D           8.23    8.03    7.82    7.53    7.34    7.91    9.5 
## 14 8           V          26.0    27.4    27.7    26.9    26.0    25.6    25.7 
## 15 9           D           8.59    8.54    8.43    8.26    8.25    8.94   10.4 
## 16 9           V          26.8    27.8    27.9    26.8    25.6    25.0    25.0 
## # ℹ 32 more variables: `430nm` <dbl>, `440nm` <dbl>, `450nm` <dbl>,
## #   `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>, `490nm` <dbl>, `500nm` <dbl>,
## #   `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>, `540nm` <dbl>, `550nm` <dbl>,
## #   `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>, `590nm` <dbl>, `600nm` <dbl>,
## #   `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>, `640nm` <dbl>, `650nm` <dbl>,
## #   `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>, `690nm` <dbl>, `700nm` <dbl>,
## #   `710nm` <dbl>, `720nm` <dbl>, `730nm` <dbl>, `740nm` <dbl>

5. left join, right join, inner join, full join!

Any SQL people out there? Often, we have more than one dataset or table and we want to join them based on a reference variable. These "join" operations can go in different directions, depending on which table you want to complete:

You will most likely use left_join() most, where you wish to pull additional data from a second table into your first/primary table. Let's load the second data table that we have, which provides additional information on the experiment.

dat_exp<-read_csv("../data/experiment_data.csv")
dat_exp
## # A tibble: 8 × 2
##   specimen_id background
##         <dbl> <chr>     
## 1           7 white     
## 2           8 white     
## 3           9 white     
## 4          10 black     
## 5          11 black     
## 6          12 black     
## 7          21 white     
## 8          22 black

This table provides extra information on what background each of our 8 tadpole was reared on. We can use the specimen_id column to match this information with the sample column in our dataset

dat_mean %>%
  left_join(dat_exp)

You should get an error here! I have pruposefully left this error in here to demonstrate that the join functions will automatically look for one or more variables (columns) with the same name to use as the joining variable, but if these are of different data classes (in this case one is numeric and the other is a character string), it won't know what do to. We can fix that by just changing the data class for one of them. Lets do that, and lets save our new data object.

dat_final<-dat_mean %>% # create a new object
  mutate(specimen_id=as.numeric(specimen_id)) %>% # overwrite specimen_id as numeric variable
  left_join(dat_exp) %>% # pull over extra columns from the dat_exp table
  select(specimen_id, position, background, everything()) # arrange the columns

dat_final
## # A tibble: 16 × 42
##    specimen_id position background `360nm` `370nm` `380nm` `390nm` `400nm`
##          <dbl> <chr>    <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1          10 D        black         3.10    3.04    3.00    2.9     2.80
##  2          10 V        black        14.2    14.2    13.9    13.2    12.5 
##  3          11 D        black         3.88    3.70    3.57    3.36    3.16
##  4          11 V        black        15.8    15.5    15.1    14.4    13.6 
##  5          12 D        black         8.36    8.01    7.78    7.40    7.04
##  6          12 V        black        14.2    14.3    14.2    13.6    13.0 
##  7          21 D        white         9.05    8.50    8.05    7.49    6.94
##  8          21 V        white        40.1    41.2    40.9    38.9    36.2 
##  9          22 D        black         4.64    4.36    4.17    3.96    3.73
## 10          22 V        black        19.5    18.9    18.3    17.3    16.4 
## 11           7 D        white         8.48    8.28    7.99    7.47    6.93
## 12           7 V        white        35.6    37.9    38.1    36.3    34.3 
## 13           8 D        white         8.23    8.03    7.82    7.53    7.34
## 14           8 V        white        26.0    27.4    27.7    26.9    26.0 
## 15           9 D        white         8.59    8.54    8.43    8.26    8.25
## 16           9 V        white        26.8    27.8    27.9    26.8    25.6 
## # ℹ 34 more variables: `410nm` <dbl>, `420nm` <dbl>, `430nm` <dbl>,
## #   `440nm` <dbl>, `450nm` <dbl>, `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>,
## #   `490nm` <dbl>, `500nm` <dbl>, `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>,
## #   `540nm` <dbl>, `550nm` <dbl>, `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>,
## #   `590nm` <dbl>, `600nm` <dbl>, `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>,
## #   `640nm` <dbl>, `650nm` <dbl>, `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>,
## #   `690nm` <dbl>, `700nm` <dbl>, `710nm` <dbl>, `720nm` <dbl>, …

6. Pivot longer and wider

When one variable is nested within another variable, this information can be stored as either a "wide" table, or a "long" table. (Think about multiple species in a genus, or multiple morphological measurements taken from a single animal).

One way of thinking about it is that wide tables have a single ID column and then many value columns, whereas a long table has many ID columns and only a single value column. Coming from Excel, and base R, we are probably more familiar with wide tables, but the tidyverse really likes long tables. To switch between them, we use pivot_longer() and pivot_wider().

Lets take another look at our final dataset

dat_final
## # A tibble: 16 × 42
##    specimen_id position background `360nm` `370nm` `380nm` `390nm` `400nm`
##          <dbl> <chr>    <chr>        <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
##  1          10 D        black         3.10    3.04    3.00    2.9     2.80
##  2          10 V        black        14.2    14.2    13.9    13.2    12.5 
##  3          11 D        black         3.88    3.70    3.57    3.36    3.16
##  4          11 V        black        15.8    15.5    15.1    14.4    13.6 
##  5          12 D        black         8.36    8.01    7.78    7.40    7.04
##  6          12 V        black        14.2    14.3    14.2    13.6    13.0 
##  7          21 D        white         9.05    8.50    8.05    7.49    6.94
##  8          21 V        white        40.1    41.2    40.9    38.9    36.2 
##  9          22 D        black         4.64    4.36    4.17    3.96    3.73
## 10          22 V        black        19.5    18.9    18.3    17.3    16.4 
## 11           7 D        white         8.48    8.28    7.99    7.47    6.93
## 12           7 V        white        35.6    37.9    38.1    36.3    34.3 
## 13           8 D        white         8.23    8.03    7.82    7.53    7.34
## 14           8 V        white        26.0    27.4    27.7    26.9    26.0 
## 15           9 D        white         8.59    8.54    8.43    8.26    8.25
## 16           9 V        white        26.8    27.8    27.9    26.8    25.6 
## # ℹ 34 more variables: `410nm` <dbl>, `420nm` <dbl>, `430nm` <dbl>,
## #   `440nm` <dbl>, `450nm` <dbl>, `460nm` <dbl>, `470nm` <dbl>, `480nm` <dbl>,
## #   `490nm` <dbl>, `500nm` <dbl>, `510nm` <dbl>, `520nm` <dbl>, `530nm` <dbl>,
## #   `540nm` <dbl>, `550nm` <dbl>, `560nm` <dbl>, `570nm` <dbl>, `580nm` <dbl>,
## #   `590nm` <dbl>, `600nm` <dbl>, `610nm` <dbl>, `620nm` <dbl>, `630nm` <dbl>,
## #   `640nm` <dbl>, `650nm` <dbl>, `660nm` <dbl>, `670nm` <dbl>, `680nm` <dbl>,
## #   `690nm` <dbl>, `700nm` <dbl>, `710nm` <dbl>, `720nm` <dbl>, …

Question: Is the dataset long or wide?

  • Although there are some "long" aspects to the table (we have more than one sample information column: specimen id, position, background) and these do not consist of unique rows.
  • If we are interested in the reflectance measurements, this table is primarily "wide", with each reflectance measurement for each wavelength in its own row.
## lets reshape it!
dat_long<-dat_final %>%
  pivot_longer(-c(specimen_id, position, background), names_to = "wavelength", values_to = "reflectance")

dat_long
## # A tibble: 624 × 5
##    specimen_id position background wavelength reflectance
##          <dbl> <chr>    <chr>      <chr>            <dbl>
##  1          10 D        black      360nm             3.10
##  2          10 D        black      370nm             3.04
##  3          10 D        black      380nm             3.00
##  4          10 D        black      390nm             2.9 
##  5          10 D        black      400nm             2.80
##  6          10 D        black      410nm             2.82
##  7          10 D        black      420nm             2.88
##  8          10 D        black      430nm             2.88
##  9          10 D        black      440nm             2.91
## 10          10 D        black      450nm             3.01
## # ℹ 614 more rows
## while we are at it, lets remove the units for the wavelengths
dat_long<-dat_long %>%
  mutate(wavelength=str_remove(wavelength, "nm") %>%
           as.numeric()) # notice that I am using a pipe within a funciton. this is also possible!

dat_long
## # A tibble: 624 × 5
##    specimen_id position background wavelength reflectance
##          <dbl> <chr>    <chr>           <dbl>       <dbl>
##  1          10 D        black             360        3.10
##  2          10 D        black             370        3.04
##  3          10 D        black             380        3.00
##  4          10 D        black             390        2.9 
##  5          10 D        black             400        2.80
##  6          10 D        black             410        2.82
##  7          10 D        black             420        2.88
##  8          10 D        black             430        2.88
##  9          10 D        black             440        2.91
## 10          10 D        black             450        3.01
## # ℹ 614 more rows

Although this may seem trivial or even unnecessary at first glance, it is a hugely important data transformation technique, especially in combination with group_by() and plotting.

Visualizing data with ggplot2

A package of the tidverse that many of you may know already is ggplot2. To build plots using ggplot takes three general steps.

1. Create a new ggplot object

# to build a plot we have to define two basic aspects
# 1. what is our dataset? - defined by  "data="
# 2. what variables do we want to plot? - defined by mapping the aesthetics, or "mapping=aes()"
ggplot(data=dat_long,
       mapping=aes(x=wavelength, y=reflectance))

2. Add plot layers

Once the plot as been created, you can add any plot layer you like, using geoms. For example, the x and y data as points. Note: to add plot elements or layers, we use +, which operate similarly to pipes.

dat_long %>% # we can also pipe the data to the plot. This is useful if we want to manipulate it first
  filter(specimen_id==10) %>% # keep only one sample (10)
  ggplot(mapping=aes(x=wavelength, y=reflectance)) +
  geom_point()

Different geoms allow for different data visualisation. Let's draw lines instead of dots.

# line graph
dat_long %>% # we can also pipe the data to the plot. This is useful if we want to manipulate it first
  filter(specimen_id==10) %>% # keep only one sample (10)
  ggplot(mapping=aes(x=wavelength, y=reflectance, group=position)) + # add grouping to have different lines for dorsal and ventral measurements
  geom_line()

3. Styling visualizations

Different styling can be added at different parts of the build.

Adding a fixed colour is done outside the aes()

dat_long %>% # we can also pipe the data to the plot. This is useful if we want to manipulate it first
  filter(specimen_id==10) %>% # keep only one sample (10)
  ggplot(mapping=aes(x=wavelength, y=reflectance, group=position, color=position)) + # add grouping to have different lines for dorsal and ventral measurements
  geom_line()

General theme elements can be manipulated both with canned theme functions, or manually

# line graph
dat_long %>% # we can also pipe the data to the plot. This is useful if we want to manipulate it first
  filter(specimen_id==10) %>% # keep only one sample (10)
  ggplot(mapping=aes(x=wavelength, y=reflectance, color=position)) + # colour replaces grouping for have different lines for dorsal and ventral measurements
  geom_line() +
  ## add a title
  ggtitle("Spectral reflectances of dorsal and ventral skin") +
  ## apply a canned theme
  theme_classic()

Question: What does this plot tell us?

  • The dorsal skin essentially reflects no light (i.e. it is dark)
  • The ventral skin reflects more light overall (i.e. area under the curve is greater)
  • The ventral skin has higher reflectances at higher wavelengths. Ventral skin is therefore in the yellow-red range, not in the violet-blue range.

4. Faceting

So far, we have only plotted the dorsal and ventral spectra of a single specimen. We could remove the filter, and plot all samples at once.

dat_long %>%
  ggplot(mapping=aes(x=wavelength, y=reflectance, group=paste(specimen_id, background, position))) + 
  geom_line() +
  theme_classic()

However, here we had to do a rather awkward grouping by pasting the three variables together and in general, it is a little difficult to make out anything here. GGplot is great for organizing multiple plots for groups of data that share one or both axes. This is done with facet_wrap(). Let's give that a go.

dat_long %>%
  ggplot(mapping=aes(x=wavelength, y=reflectance, group=specimen_id, color=background)) + # add grouping to have different lines for dorsal and ventral measurements
  geom_line() +
  facet_wrap(~position) +
  theme_classic()

We now have the dorsal and ventral spectra split into the two plots that share the same y-axis and we can identify the samples from the different experimental treatments with colours. We could facet this plot two ways as well, using facet_grid().

dat_long %>%
  ggplot(mapping=aes(x=wavelength, y=reflectance, group=specimen_id)) + # add grouping to have different lines for dorsal and ventral measurements
  geom_line() +
  facet_grid(position~background) +
  theme_classic()

As one final exercise, lets use some of ggplot's built in summarizing functions to plot a smoothed line through each

dat_long %>%
  ggplot(mapping=aes(x=wavelength, y=reflectance, color=background)) + # drop the grouping variable here
  geom_smooth(stat="smooth", method="loess", span=0.2) + # fit loess smoothing with a more relaxed smoothing span
  facet_wrap(~position) +
  theme_classic()

Question: What can we conclude from this?

  • The dorsal skin is always darker (reflects less) than the ventral skin, regardless of treatments
  • For both dorsal and ventral skins, tadpoles reared on black backgrounds are darker (reflect less)
  • the shapes of the curves are more or less the same (higher reflectance of longer wavelengths) and so the hue is fairly similar across treatments, only the overall brightness (the area under the curves) is different.

Exercise

Question: If we assume that the total area under the curve represents the overall brightness, plot the reaction norms of dorsal and ventral brightnesses for tadpoles from black versus white backgrounds. You can show these as boxplots, or if you can, show the mean reaction norms with standard deviation error bars.

# calculate area under the curve for each specimen
dat_area<-dat_long %>%
  group_by(specimen_id, position, background) %>%
  summarise(brightness=sum(reflectance)) %>%
  ungroup()

# plot boxplot
dat_area %>%
  ggplot(aes(x=background, y=brightness, fill=position)) +
  geom_boxplot()

# calculate means and sd and plot as reaction norms
dat_area%>%
  group_by(position, background) %>%
  mutate(mean_brightness=mean(brightness),
         sd_brightness=sd(brightness)) %>%
  ggplot(aes(x=background, color=position)) +
  geom_point(aes(y=brightness)) +
  geom_point(aes(y=mean_brightness), size=4) +
  geom_line(aes(y=mean_brightness, group=position)) +
  geom_errorbar(aes(ymin=mean_brightness-sd_brightness, ymax=mean_brightness+sd_brightness), width=0.1) +
  theme_bw()

Final comments:

  • Switching to the tidyverse can be a little daunting at first and may seem redundant. Many things can be done in base R. However, it is a powerful too for complex data organization and manipulation.
  • Learn by doing! As with any programming language, the best way to learn is to just get your hands dirty. Your regular google search should be "how do i ________ using the tidyverse?".
  • Don't be afraid to use ChatGPT, but try it yourself first, ask for specific bits of code, not everything, and make an effort to understand what each step of the code is doing.